Comparing Sequential Models: RNN, LSTM and GRUΒΆ
Comparing Sequential Models: RNN vs. LSTM vs. GRU on Ethereum Price Prediction For a long time, the deep learning community has considered LSTMs and GRUs to be superior to vanilla RNNs, thanks to the advancements in their architecture. But how much better are they, really? Today, we're putting these three fundamental recurrent neural networks to the test:
- RNN (Recurrent Neural Network)
- LSTM (Long Short-Term Memory)
- GRU (Gated Recurrent Unit)
We will compare them on a challenging and relevant task, without any pre-processing which can make optimize the process, just raw power: predicting the stock price of Ethereum (ETH), a notoriously volatile cryptocurrency.
Why LSTM and GRU, were introduced, when we had RNN
Before diving into the comparison, it's crucial to understand why LSTMs and GRUs were created in the first place. Simple RNNs suffer from a major flaw known as the vanishing gradient problem.
In essence, when an RNN processes long sequences of data (like many days of stock prices), the information from the initial steps gets diluted. As the network learns, the signal (or "gradient") used to update the model's weights shrinks exponentially over time. By the end of the sequence, the gradient is so small that the network can't learn from the early data. This means a simple RNN might forget the price from a month ago, even if it's crucial for predicting tomorrow's price.
LSTMs and GRUs were designed specifically to solve this. They introduce gatesβinternal mechanisms that regulate the flow of information, allowing the network to "remember" important data over long periods and "forget" irrelevant data.
LSTMs have three gates: an input gate (decides what new information to store), a forget gate (decides what information to discard), and an output gate (decides what to output based on the stored memory).
GRUs are a simplified version with two gates: a reset gate (decides how much past information to forget) and an update gate (decides how much of the new input to let through).
Back to our Task
For our experiment, we're using historical daily data for Ethereum from CoineLore. The dataset includes the Date, Open, High, Low, Close, Volume, Volume(ETH), and Market Cap. Our data spans from August 7, 2015, to August 16, 2025, for a total of 3,633 daily observations. While a larger dataset is often better, working with this amount of data is a realistic test for our models.
Following is a preview of our raw data...
| Date | Open | High | Low | Close | Volume | Volume(ETH) | Market Cap | |
|---|---|---|---|---|---|---|---|---|
| 0 | August 16 2025 | $4441 | $4488 | $4379 | $4423 | $19.6 bn | 4433020 | $534.4 bn |
| 1 | August 15 2025 | $4548 | $4657 | $4382 | $4426 | $38.3 bn | 8440324 | $548.5 bn |
| 2 | August 14 2025 | $4751 | $4784 | $4465 | $4553 | $54 bn | 11610229 | $561.7 bn |
| 3 | August 13 2025 | $4591 | $4773 | $4571 | $4760 | $45.4 bn | 9716317 | $564.2 bn |
| 4 | August 12 2025 | $4224 | $4621 | $4224 | $4594 | $42.8 bn | 9759438 | $529.7 bn |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3629 | August 11 2015 | $0.7081 | $1.13 | $0.6632 | $1.07 | $1.5 m | 1638721 | $53.9 m |
| 3630 | August 10 2015 | $0.7140 | $0.7299 | $0.6365 | $0.7084 | $405.3 K | 581293 | $42.1 m |
| 3631 | August 9 2015 | $0.7061 | $0.8798 | $0.6292 | $0.7019 | $532.2 K | 729741 | $44.1 m |
| 3632 | August 8 2015 | $2.79 | $2.80 | $0.7147 | $0.7533 | $674.2 K | 382082 | $106.5 m |
| 3633 | August 7 2015 | $2.83 | $3.54 | $2.52 | $2.77 | $164.3 K | 56374 | $175.3 m |
3634 rows Γ 8 columns
As you noted, the raw data required some cleaning. After converting string values to integers and floats, now we are ready to transform our clean data to formated data which our model understands.
(Important: Stock prices and volumes are on vastly different scales. But, we will not be performing any normalization, to make our models go wild)
To structure the data into sequences. For our scenario, we will be use the data from the last 30 days to predict the price for the 31st day. This creates the input-output pairs our models will learn from. Following figure, explains our transformation clearly.
Once again I will highly recommend to refer Chris Colah's blog on Understanding LSTM Networks, also Daniel Voigt Godoy's diagrams are great to understand the internal mechanism and working of RNNs and LSTMs.
Model Evaluation and ComparisonΒΆ
Now, let's skip the training details and jump straight to the results. We will evaluate our models using common regression metrics like Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). Both measure the average error in the model's predictions, but RMSE penalizes larger errors more heavily. In stock prediction, a lower RMSE and MAE mean the model's price predictions are closer to the actual prices.
Stacked Layers ComparisonΒΆ
Stacking recurrent layers is a powerful technique. A single-layer model might learn simple patterns, but a multi-layered (or "deep") model can learn more complex, hierarchical patterns in the data. The first layer might learn short-term trends, while subsequent layers learn from the output of the previous ones to identify more abstract, long-term patterns.
So, first of all we will be comparing perfomance improvements by stacking layers in our sequential models. To achieve this, for each of our model we will be iterating over number of initialy stacked sequential layers, followled by a linear layer. To see the overall trend we will be taking 1, 2, 3, 4, 6, and 8 stacked layers for each of our models, for 2,000 epochs.
Your results clearly show a few key trends:
Diminishing Returns: The improvement from 4 to 8 layers is likely less pronounced and may even lead to overfitting, where the model learns the training data too well and performs poorly on new, unseen data. The increased training time and complexity might not be worth the marginal gain.
On an overall scale, RNN is not performing bad as compared to LSTM and GRU, infact it beat them in some scenarios. (one of the reasons can be beacuse of our data's scale, variance and non-stationarity, and over samples)
Architecture Difference ComparisionΒΆ
We know that sequetial layers are better than Fully Connected, but the question is how much. So the question is: what happens if we insert a linear (or Dense) layer between our stacked recurrent layers?
Standard Stacking:
LSTM -> LSTM -> LinearTinkered Architecture:
LSTM -> Linear -> LSTM -> Linear
Adding a linear layer between recurrent layers is an unconventional approach. Hereβs the likely impact:
A linear layer's primary job is to perform a linear transformation on the data (output = activation(weights * input + bias)). When placed between recurrent layers, it can disrupt the flow of sequential information. The hidden state passed from the first LSTM layer, which is rich with temporal context, gets transformed into a new space by the linear layer. The second LSTM layer then has to learn the sequential patterns from this transformed data, which can be less efficient or even detrimental.
In most cases, this architecture will not increase accuracy. The strength of stacked RNNs comes from their ability to build hierarchical temporal representations directly on top of each other. Inserting a linear layer breaks this direct temporal hierarchy. However, it's an interesting experiment that demonstrates the importance of maintaining the integrity of sequential processing in deep recurrent models.
So, ther results were as we aspected, adding a extra FC layer in between 2 stacked layers does not yeild proper results, in-fact in some of the cases like RNN and GRU, it performed poorly as compared to other models.
Conclusion and Final Thoughts
We were not able to draw any concrete conclusions, model selection still heavily resides based on the data complexcity.
Stacking recurrent layers is an effective strategy to improve model performance by learning more complex patterns, though with diminishing returns and a risk of overfitting.
Architectural choices matter. Sticking to conventional designs, like stacking recurrent layers directly, is usually the most effective approach for processing sequential data.
While GRUs often train faster and perform on par with LSTMs, the choice between them can depend on the specific dataset and computational resources.
ReferenceΒΆ
- Colah's Blog : https://colah.github.io/posts/2015-08-Understanding-LSTMs/
- Daniel Voigt Godoy : https://github.com/dvgodoy/dl-visuals